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Abstract 

One of the main challenges in Zero-Shot Learning of vi¬ 
sual categories is gathering semantic attributes to accom¬ 
pany images. Recent work has shown that learning from 
textual descriptions, such as Wikipedia articles, avoids the 
problem of having to explicitly define these attributes. We 
present a new model that can classify unseen categories 
from their textual description. Specifically, we use text fea¬ 
tures to predict the output weights of both the convolutional 
and the fully connected layers in a deep convolutional neu¬ 
ral network (CNN). We take advantage of the architecture 
of CNNs and learn features at different layers, rather than 
just learning an embedding space for both modalities, as 
is common with existing approaches. The proposed model 
also allows us to automatically generate a list of pseudo¬ 
attributes for each visual category consisting of words from 
Wikipedia articles. We train our models end-to-end us¬ 
ing the Caltech-UCSD bird and flower datasets and eval¬ 
uate both ROC and Precision-Recall curves. Our empirical 
results show that the proposed model significantly outper¬ 
forms previous methods. 


1. Introduction 

The recent success of the deep learning approaches to 
object recognition is supported by the collection of large 
datasets with millions of images and thousands of la¬ 
bels |[3l|33l. Although the datasets continue to grow larger 
and are acquiring a broader set of categories, they are very 
time consuming and expensive to collect. Furthermore, col¬ 
lecting detailed, fine-grained annotations, such as attribute 
or object part labels, is even more difficult for datasets of 
such size. 

On the other hand, there is a massive amount of textual 
data available online. Online encyclopedias, such as En¬ 
glish Wikipedia, currently contain 4,856,149 articles, and 
represent a rich knowledge base for a diverse set of topics. 
Ideally, one would exploit this rich source of information in 



1xk 



iWikipedia article 


The Cardinals or Cardinalidae are a family of passerine 



birds found in North and South America 



The South American cardinals in the qenus... 




Figure 1. A deep multi-modal neural network. The first modality 
corresponds to tf-idf features taken from a text corpus with a corre¬ 
sponding class, e.g., a Wikipedia article about a particular object. 
This is passed through a multi-layer perceptron (MLP) and pro¬ 
duces a set of linear output nodes /. The second modality takes in 
an image and feeds it into a convolutional neural network (CNN). 
The last layer of the CNN is then passed through a linear projec¬ 
tion to produce a set of image features g. The score of the class is 
produced via In this sense, the text pipeline can be though 
of as producing a set of classifier weights for the image pipeline. 


order to train visual object models with minimal additional 
annotation. 

The concept of “Zero-Shot Learning” has been intro¬ 
duced in the literature E m da 121113 [ni with the aim 
to improve the scalability of traditional object recognition 
systems. The ability to classify images of an unseen class is 
transferred from the semantically or visually similar classes 
that have already been learned by a visual classifier. One 
popular approach is to exploit shared knowledge between 
classes in the form of attributes, such as stripes, four legs, 
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or roundness. There is typically a much smaller percep¬ 
tual (describable) set of attributes than the number of all 
objects, and thus training classifiers for them is typically a 
much easier task. Most work pre-defines the attribute set, 
typically depending on the dataset used, which somewhat 
limits the applicability of these methods on a larger scale. 

In this work, we build on the ideas of O and introduce a 
novel Zero-Shot Learning model that predicts visual classes 
using a text corpus, in particular, the encyclopedia corpus. 
The encyclopedia articles are an explicit categorization of 
human knowledge. Each article contains a rich implicit an¬ 
notation of an object category. For example, the Wikipedia 
entry for “Cardinal” gives a detailed description about this 
bird’s distinctive visual features, such as colors and shape 
of the beak. The explicit knowledge sharing in encyclope¬ 
dia articles are also apparent through their inter-references. 
Our model aims to generate image classifiers directly from 
encyclopedia articles of the classes with no training images. 
This overcomes the difficulty of hand-crafted attributes and 
the lack of fine-grained annotation. Instead of using simple 
word embeddings or short image captions, our model op¬ 
erates directly on a raw natural language corpus and image 
pixels. 

Our first contribution is a novel framework for predict¬ 
ing the output weights of a classifier on both the fully con¬ 
nected and convolutional layers of a Convolutional Neu¬ 
ral Network (CNN). We introduce a convolutional classi¬ 
fier that operates directly on the intermediate feature maps 
of a CNN. The convolutional classifier convolves the fea¬ 
ture map with a filter predicted by the text description. The 
classification score is generated by global pooling after con¬ 
volution. We also empirically explore combining features 
from different layers of CNNs and their effects on the clas¬ 
sification performance. 

We evaluate the common objective functions used in 
Zero-Shot Learning and rank-based retrieval tasks. We 
quantitatively compare performance of different objective 
functions using ROC-AUC, mean Average-Precision and 
classification accuracy. We show that different cost func¬ 
tions outperform each other under different evaluation met¬ 
rics. Evaluated on Caltech-UCSD Bird dataset and Ox¬ 
ford fiower dataset, our proposed model significantly out¬ 
performs the previous state-of-the-art Zero-Shot Learning 
approach O. In addition, the testing performance of our 
model on the seen classes are comparable to the state-of- 
the-art fine-grained classifier using additional annotations. 

Finally, we show how our trained model can be used to 
automatically discover a list of class-specific attributes from 
encyclopedia articles. 


2. Related work 

2.1. Domain adaptation 

Domain adaptation concerns the problem where there are 
two distinct datasets, known as the source and target do¬ 
mains respectively. In the typical supervised setting, one is 
given a source training set S ^ Ps and a target training set 
T ~ Pr^ where Ps 7 ^ Pr- The goal is to transfer informa¬ 
tion from the source domain to the target domain in order to 
produce a better predictor than training on the target domain 
alone. Unlike zero-shot learning, the class labels in domain 
adaptation are assumed to be known in advance and fixed. 

There has been substantial work in computer vision to 
deal with domain adaption. (231 EH address the problem 
mentioned above where access to both source and target 
data are available at training time. This is extended in dQi 
to the unsupervised setting where target labels are not avail¬ 
able at training time. In (271, there is no target data avail¬ 
able, however, the set of labels is still given and is consis¬ 
tent across domains. In 1121 the authors explicitly account 
for inter-dataset biases and are able to train a model that is 
invariant to these. ED considered unified formulation of 
domain adaptation and multi-task learning where they com¬ 
bine different domains using a dot-product operator. 

2.2. Semantic label embedding 

Image and text embeddings are projections from the 
space of pixels, or the space of text, to a new space where 
nearest neighbours are semantically related. In semantic 
label embedding, image and label embeddings are jointly 
trained so that semantic information is shared between 
modalities. For example, an image of a tiger could be em¬ 
bedded in a space where it is near the label “tiger”, while 
the label “tiger” would itself be near the label “lion”. 

In (29l, this is accomplished via a ranking objective us¬ 
ing linear projections of image features and bag-of-words 
attribute features. In (9l, label features are produced by an 
unsupervised skip-gram model d trained on Wikipedia 
articles, while the image features are produced by a CNN 
trained on Imagenet oa. This allows the model to use se¬ 
mantic relationships between labels in order to predict la¬ 
bels that do not appear in the training set. While (9l removes 
the final classification layer of the CNN, (20l retains it and 
uses the uncertainty in the classifier to produce a final em¬ 
bedding from a convex combination of label embeddings. 
(26l uses unsupervised label embeddings together with an 
outlier detector to determine whether a given image corre¬ 
sponds to a known label or a new label. This allows them to 
use a standard classifier when the label is known. 

2.3. Zero-Shot learning from attribute vectors 

A key difference between semantic label embedding and 
the problem we consider here is that we do not consider 


the semantic relationship between labels. Rather, we as¬ 
sume that the labels are themselves composed of attributes 
and attempt to learn the semantic relationship between the 
attributes and images. In this way, new labels can be con¬ 
structed by combining different sets of attributes. This setup 
has been previously considered in EEa, where the at¬ 
tributes are manually annotated. In 16 ), the training set at¬ 
tributes are predicted along with the image label at test time. 
1 ^ explores relative attributes, which captures how images 
relate to each other along different attributes. 

Our problem formulation is inspired by in that we at¬ 
tempt to derive embedding features for each label directly 
from natural language descriptions, rather than attribute an¬ 
notations. The key difference is in our architecture, where 
we use deep neural networks to jointly embed image and 
text features rather than using probabilistic regression with 
domain adaptation. 

3. Predicting a classifier 

The overall goal of the model is to learn an image clas¬ 
sifier from natural language descriptions. During training, 
our model takes a set of text features (e.g. Wikipedia ar¬ 
ticles), each representing a particular class, and a set of 
images for each class. During test time, some previously 
unseen textual description (zero-shot classes) and associ¬ 
ated images are presented. Our model needs to classify the 
images from unseen visual classes against images from the 
trained classes. We first introduce a general framework to 
predict linear classifier weights and extend the concept to 
convolutional classifiers. 

Given a set of N image feature vectors x e and their 
associated class labels I G {l,...,C},we have a training set 
'Strain = N• There are C distinct class labels 

available for training. During test time, we are given addi¬ 
tional no number of the previously unseen classes, such that 
kest ^ { 1 ,C, ...C+no} and test images xtest associated 
with those unseen classes, Vtest = {{4eL 

3.1. Predicting a linear classifier 

Let us consider a standard binary one vs. all linear clas¬ 
sifier whose score is given 

yc = wjx, ( 1 ) 

where Wc is the weight vector for a particular class c. It is 
hard to deal with unseen classes using this standard formu¬ 
lation. Let us further assume that we are provided with an 
additional text feature vector tc G W associated with each 
class c. Instead of learning a static weight vector Wc, the 
text feature can be used to predict the classifier weights Wc- 
In the other words, we can define Wc to be a function of tc 

^We consider various loss functions of this score in Sectionj^ 


for a particular class c: 

Wc = (2) 

where, ft : MP is a mapping that transforms the text 

features to the visual image feature space. In the special 
case of choosing /t(-) to be a linear transformation, the for¬ 
mulation is similar to lua. In this work, the mapping ft 
is represented as a non-linear regression model that is pa¬ 
rameterized by a neural network. Given the mapping ft and 
text features for a new class, we can extended the one-vs-all 
linear classifier to the previously unseen classes. 

3.2. Predicting the output weights of neural nets 

One of the drawbacks for having a direct mapping from 
W to is that both W and are typically high dimen¬ 
sional, which makes it difficult to estimate the large number 
of parameters in /t(-). For example, in the linear transfor¬ 
mation setup, the number of parameters in /t(') is propor¬ 
tional to 0{dxp). For the problems considered in the paper, 
this implies that millions of parameters need to be estimated 
from only a few thousand data points. In addition, most 
the parameters are highly correlated which makes gradient 
based optimization methods converge slowly. 

Instead, we introduce a second mapping parameterized 
by a multi-layer neural network i-G that trans¬ 

forms the visual image features x to a lower dimensional 
space where k « d. The dimensionality of the pre¬ 
dicted weight vector Wc can be drastically reduced using 
Qvi')- The new formulation for the binary classifier can be 
written as: 

yc=wjgy(x), (3) 

where the transformed image feature gy (x) is the output of 
a neural network. Similar to Eq. Q, Wy G is predicted 
using the text features tc with ft : i-G- . Therefore, the 

formulation in the Eq. is equivalent to a binary classi¬ 
fication neural network whose output weights are predicted 
from text features. Using neural networks, both /t(-) and 
gt{') perform non-linear dimensionality reduction of the 
text and visual features. In the special case where both /(•) 
and g{’) are linear transformations, Eq. 0 is equivalent to 
the low rank matrix factorization ca. A visualization of 
this model is shown in Eigure[2 

3.3. Predicting a convolutional classifier 

Convolutional neural networks (CNNs) are currently the 
most accurate models for object recognition tasks ifTHl . 
In contrast to traditional hand-engineered features, CNNs 
build a deep hierarchical multi-layer feature representation 
from raw image pixels. It is common to boost the perfor¬ 
mance of a vision system by using the features from the 
fully connected layer of a CNN (H. Although, the im¬ 
age features obtained from the top fully connected layer of 



CNNs are useful for generic vision pipelines, there is very 
little spatial and local information retained in them. The 
feature maps from the lower convolutional layers on the 
other hand contain local features arranged in a spatially co¬ 
herent grid. In addition, the weights in a convolution layer 
are locally connected and shared across the feature map. 
The number of trainable weights are far fewer than the fully 
connected layers. It is therefore appealing to predict the 
convolutional filters using text features due to the relatively 
small number of parameters. 

Let a denote the extracted activations from a convolu¬ 
tional layer with M feature maps, where a G 
with ai representing the feature map of a, and w, h 
denoting the width and height of a feature map. Unlike 
previous approaches, we directly formulate a convolutional 
classifier using the feature maps from convolutional layers. 
First, we perform a non-linear dimensionality reduction to 
reduce the number of feature maps as in Sec. ( |3.2[ ). Let 
be a reduction mapping g'^ : i-G xwx/i 

where K' « M. The reduced feature map is then defined 
as a' = gy{a). Given the text features tc for a particular 
class c, we have the corresponding predicted convolutional 
weights w'^ = fl{tc), where w'^ G M^'xsxs ^ 
size of the predicted filter. Similarly to the fully connected 
model, //(•) is parameterized by a multi-layer neural net¬ 
work. We can formulate a convolutional classifier as fol¬ 
lows: 


y' 


c 



(4) 


down the joint classification model as: 



Both the convolutional weights w'^ and the fully connected 
weights Wc are predicted from the text feature tc using a 
single multi-task neural network with shared layers. 


4. Learning 

The mapping functions /(•) and g{') that transform text 
features into weights are neural networks that are parame¬ 
terized by a matrix W. The goal of learning is to adjust W 
so that the model can accurately classify images based on a 
textual description. Let us consider a training set containing 
C textual descriptions (e.g. C Wikipedia articles), one for 
each class c, and N images. We next examine the following 
two objective functions for training our model. 

4.1. Binary Cross Entropy 

For an image feature Xi and a text feature tj , an indica¬ 
tor lij is used to encode whether the image corresponds to 
the class represented by the text using a 0-1 encoding. The 
binary cross entropy is the most intuitive objective function 
for our predicted binary classifier: 


N C r 


aw^)=EE 

i=i j=i 


Iij\oga{yj{xi,tj)) 


+ (1 - 4i)log(l - 


, ( 6 ) 


where o(') is a global pooling function such that o : 

^ ^ i-G M and * denotes the convolution that is typically 
used in convolutional layers. By convolving the predicted 
weights over the feature maps, we encourage the model to 
learn informative location feature detectors based on tex¬ 
tual descriptions. The global pooling o{-) operation aggre¬ 
gates the local features over the whole image and produces 
the score. Depending on the type of the pooling operation, 
such as noisy-or average pooling or max pooling, the con¬ 
volutional classifier will have different sensitivities to local 
features. In our experimental results, we found that average 
pooling works well in general while max pooling suffers 
from over-fitting. 

3.4. Predicting a joint classifier 

We can also take advantage of the CNN architecture by 
using features extracted from both the intermediate convo¬ 
lutional layers and the final fully connected layer. Given 
convolutional feature a and fully connected feature x after 
propagating the raw image through the CNN, we can write 


where cr is the sigmoid function y = 1/(1 -he“^). In 
the above equation, each image is evaluated against all C 
classes during training, which becomes computationaly ex¬ 
pensive as the number of classes grows. Instead, we use 
a Monte Carlo minibatch scheme to approximate the sum¬ 
mation over the all images and all classes from Eq. I©- 
Namely, we draw a mini-batch of B images and compute 
the cost by summing over the images in the minibatch. 
We also sum over all the image labels from the minibatch 
only. The computational cost for this minibatch scheme is 
0{B X B), instead of 0(A^ x C). 


4.2. Hinge Loss 

We further considered a hinge loss objective. Hinge loss 
objective functions are the most popular among the retrieval 
and ranking tasks for multi-modal data. In fact, predict¬ 
ing the output layer weights of a neural network (see Sec. 
(3.2)) can be formulated as a ranking task between text de¬ 
scriptions and visual images. Although the formulation is 
similar, the focus of this work is on classification rather 
than information retrieval. Let the indicator lij represent 





a {1, —1} encoding for the positive and negative class. We 
can then use the following simple hinge loss objective func¬ 
tion: 

N C 

max{0,€ - (7) 

7=1 j = l 

Here, e is the margin that is typically set to 1. This hinge 
loss objective encourages the classifier score y to be higher 
for the correct text description and lower for other classes. 
Similarly to Sec. ( |4.1[ ), a minibatch method can be adapted 
to train the hinge loss objective function efficiently. 

4.2.1 Euclidean Distance 

The Euclidean distance loss function was previously used 
in 1261 with a fixed pre-learnt word embedding. Such 
cost function can be obtained from our classifier formula¬ 
tion by expanding the Euclidean distance —\\\ci — h \\2 = 
a^b — \\\a \\2 — Minimizing the hinge loss in Eq. 

0 with the additional negative L 2 norm of both Wc and 
is equivalent to minimizing their Euclidean distance. The 
hinge loss prevents the infinite penalty on the negative ex¬ 
amples when jointly learning an embedding of class text 
descriptions and their images. 

5. Experiments 

In this section we empirically evaluate our proposed 
models and various objective functions. The fc (fully- 
connected) model corresponds to Sec. ( |3.2| ) where the text 
features are used to predict the fully-connected output 
weights of the image classifier. The conv model is the 
convolutional classifier in Sec. ( |3.3| ) that predicts the con¬ 
volutional filters for CNN feature maps. The joint model 
is denoted as fc+conv. We evaluate the predicted zero- 
shot binary classifier on test images from both unseen and 
seen classes. The evaluation for Zero-Shot Learning per¬ 
formance varies widely throughout the literature. We report 
our model performance using the most common metrics: 

ROC-AUC: This is one of the most commonly used 
metrics for binary classification. We compute the receiver 
operating characteristic (ROC) curve of our predicted bi¬ 
nary classifier and evaluate the area under the ROC curve. 

PR-AUC(AP): It has been pointed out in O that for the 
dataset where the number of positive and negative samples 
are imbalanced, the precision-recall curve has shown to be 
a better metric compared to ROC. PR-AUC is computed by 
trapezoidal integral for the area under the PR curve. PR- 
AUC is also called average precision (AP). 

Top-K classification accuracy: Although all of our 
models can be viewed as binary classifiers, one for each 
class, the multi-class classification accuracy can be com¬ 
puted by evaluating the given test image on text descriptions 
from all classes and sorting the final prediction score yc. 


5.1. Training Procedure 

In all of our experiments, image features are extracted 
by running the 19 layer VGG model pre-trained on Im- 
ageNet without fine-tuning. Specifically, to create the im¬ 
age features for the fully connected classifier, we used the 
activations from the last fully connected 4096 dimension 
hidden layer fcl. The convolutional features are generated 
using 512x14x14 feature maps from the conv5_3 layers. In 
addition, images are preprocessed similar to 1 ^ before be¬ 
ing fed into the VGG net. In particular, each image is re¬ 
sized so that the shortest dimension stays at 224 pixels. A 
center patch of 224x224 is then cropped from the resized 
image. 

Various components of our models are parameterized by 
ReLU neural nets of different sizes. The transformation 
function for textual features /t(') : 1 -^ are param¬ 

eterized by a two-hidden layer fully-connected neural net¬ 
work whose architecture is p-300-k, where p is the dimen¬ 
sionality of the text feature vectors and /c = 50 is the size of 
the predicted weight vector Wc for the fully connected layer. 
The image features from the fcl layer of the VGG net are 
fed into the visual mapping gv{')- This architecture is 4096- 
300-k. The intermediate convlayer features a G 
from the intermediate conv layer are first transformed by a 
conv layer g'^{') with K' filters of size 3 x T where we set 
K' = 5. The final a' G M^'xwx/i from Eq. are convolved 
with K' X 3 X 3 filters predicted from the 300 unit hidden 
layer of/t(-). 

Adam na is used to optimize our proposed models with 
minibatches of 200 images. We found that SGD does not 
work well for our proposed models. This is potentially due 
to the difference in magnitude between the sparse gradient 
of the text features and the dense gradients in the convo¬ 
lutional layers. This problem is avoided by using adaptive 
step sizes. 

Our model implementation is based on the open-source 
package Torch (H. The training time for the fully con¬ 
nected model is 1-2 hours on a GTX Titan, whereas the joint 
fc-i-conv model takes 4 hours to train. 

5.2. Caltech UCSD Bird 

The 200-category Caltech UCSD bird dataset 1^ is one 
of the most widely used and competitive fine-grained clas¬ 
sification benchmarks. We evaluated our method on both 
the CUB200-2010 and CUB200-2011 versions of the bird 
dataset. Instead of using semantic parts and attributes as in 
the common approaches for CUB200, we only used the raw 
images and Wikipedia articles m to train our models. 

There is one Wikipedia article associated with each bird 
class and 200 articles in total. The average number of words 
in the articles is around 400. Each Wikipedia article is trans¬ 
formed into a 9763-dimensional Term Erequency-Inverse 
Document Erequency(tf-idf) feature vector. We noticed that 



ROC-AUC 

PR-AUC 

Dataset 

Model 

unseen 

seen 

mean 

unseen 

seen 

mean 


DA (baseline feat.) (51 

0.59 

— 

— 

— 

— 

— 


DA-fGP (H (baseline feat.) 

0.62 

— 

— 

— 

— 

— 


DA Ca (VGG feat.) 

0.66 

0.69 

0.68 

0.037 

0.11 

0.094 

CU-Bird200-2010 

Ours (fc baseline feat.) 

0.69 

0.93 

0.85 

0.09 

0.20 

0.19 


Ours (fc) 

0.82 

0.96 

0.934 

0.10 

0.41 

0.35 


Ours (conv) 

0.73 

0.96 

0.91 

0.043 

0.34 

0.28 


Ours (fc+conv) 

0.80 

0.987 

0.95 

0.08 

0.53 

0.43 


Ours (fc) 

0.82 

0.974 

0.943 

0.11 

0.33 

0.286 

CU-Bird200-2011 

Ours (conv) 

0.80 

0.96 

0.925 

0.085 

0.15 

0.14 


Ours (fc+conv) 

0.85 

0.98 

0.953 

0.13 

0.37 

0.31 


DA (baseline feat.) (51 

0.62 

— 

— 

— 

— 

— 


GPR+DA (baseline feat.) O 

0.68 

— 

— 

— 

— 

— 

Oxford Flower 

Ours (fc baseline feat.) 

0.63 

0.96 

0.86 

0.055 

0.60 

0.45 

Ours (fc) 

0.70 

0.987 

0.90 

0.07 

0.65 

0.52 


Ours (conv) 

0.65 

0.97 

0.85 

0.054 

0.61 

0.46 


Ours (fc+conv) 

0.71 

0.989 

0.93 

0.067 

0.69 

0.56 


Table 1. ROC-AUC and PR-AUC(AP) performance compared to other methods. The performance is shown for both the zero-shot unseen 
classes and test data of the seen training classes. The class averaged mean AUCs are also included. For both ROC-AUC and PR-AUC, we 
report the best numbers obtained among the models trained on different objective functions. 


Log normalization for the term frequency is helpful, as arti¬ 
cle length varies substantially across classes. 

The CUB200-2010 contains 6033 images from 200 dif¬ 
ferent bird species. There are around 30 images per class. 
We follow the same protocol as in using a random split 
of 40 classes as unseen and the rest 160 classes as seen. 
Among the seen classes, we further allocate 20% of the im¬ 
ages for testing and 80% of images for training. There are 
around 3600 training set and 2500 images for testing. 5-fold 
cross-validation is used to evaluate the performance. 

In order to compare with the previously published re¬ 
sults, we first evaluated our model using image and text fea¬ 
tures from 0. Since there are no image features with spa¬ 
tial information, we are only predicting the fully connected 
weights. Visual features are first fed into a two-hidden layer 
neural net with 300 and 50 hidden units in the first and sec¬ 
ond layers. We used their processed text features to predict 
the 50 dimensional fully connected classifier weights with 
a two hidden layer neural net. A baseline Domain Adapta¬ 
tion ca method is also evaluated using the features from 
the VGG fcl layer. 

The CUB200-2011 is an updated version of CUB200- 
2010 where the number of images are increased to 11,788. 
The 200 bird classes are the same as the 2010 version, but 
with the number of training cases doubled for each class. 
We used the same experimental setup and Wikipedia articles 
as the 2010 version. 


5.3. Oxford Flower 

The Oxford Flower-102 dataset |[T9l contains 102 classes 
with a total of 8189 images. The fiowers were chosen from 
common flower species in the United Kingdom. Each class 
contains around 40 to 260 images. We used the same raw 
text corpus as in O. The experimental setup is similar to 
CUB200 where 82 flower classes are used for training and 
20 classes are used as unseen during testing. Similar to the 
CUB200-2010 dataset, we compared our method to the pre¬ 
viously published results using the same visual and text fea¬ 
tures. 

5.4. Overall results 

Our results on the Caltech UCSD Bird and Oxford 
Flower datasets, shown in Table Q, dramatically im¬ 
prove upon the state-of-the-art for zero-shot learning. This 
demonstrates that our deep approach is capable of produc¬ 
ing highly discriminative feature vectors based solely on 
natural language descriptions. We further find that pre¬ 
dicting convolutional filters (conv) and a hybrid approach 
(fc+conv) further improves model performance. 

5.5. Effect of objective functions 

We studied the model performance across the different 
objective functions from Sec. The evaluation is shown 
in Table Q. The models trained with binary cross entropy 
(BCE) have a good balance between ROC-AUC, PR-AUC 
and classification accuracy. The models trained with the 
hinge loss constantly outperform the others on the PR-AUC 
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Figure 2. [LEFT]: Word sensitivities of unseen classes using the fc model on CUB200-2010. The dashed lines correspond to the test-set PR- 
AUC for each class. TF-IDF entries are then independently set to 0 and the five words that most reduce the PR-AUC are shown in each bar 
chart. Approximately speaking, these words can be considered to be important attributes for these classes. [RIGHT]: The Wikipedia article 
for each class is projected onto its feature vector w and the nearest image neighbors from the test-set (in terms of maximal dot product) 
are shown. The within-class nearest neighbors only consider images of the same class, while the overall nearest neighbors considers all 
test-set images. 


Metrics 

BCE 

Hinge 

Euclidean 

unseen ROC-AUC 

0.82 

0.795 

0.70 

seen ROC-AUC 

0.973 

0.97 

0.95 

mean ROC-AUC 

0.937 

0.934 

0.90 

unseen PR-AUC 

0.103 

0.10 

0.076 

seen PR-AUC 

0.33 

0.41 

0.37 

mean PR-AUC 

0.287 

0.35 

0.31 

unseen class acc. 

0.01 

0.006 

0.12 

seen class acc. 

0.35 

0.43 

0.263 

mean class acc. 

0.17 

0.205 

0.19 

unseen top-5 acc. 

0.176 

0.182 

0.428 

seen top-5 acc. 

0.58 

0.668 

0.45 

mean top-5 acc. 

0.38 

0.41 

0.44 


Table 2. Model performance using various objective functions on 
CUB-200-2010 dataset. The numbers are reported by training the 
fully-connected models. 


metric. However, the hinge loss models do not perform well 
on top-K classification accuracy on the zero-shot classes 
compared to other loss functions. The Euclidean distance 
model seems to perform well on the unseen classes while 
achieving a much lower accuracy on the seen classes. BCE 
shows the best overall performance across the three metrics. 


Metrics 

Conv5 3 

Conv4 3 

Pool5 

mean ROC-AUC 

0.91 

0.6 

0.82 

mean PR-AUC 

0.28 

0.09 

0.173 

mean top-5 acc. 

0.25 

0.153 

0.02 


Table 3. Performance comparison using different intermediate 
ConvLayers from VGG net on CUB-200-2010 dataset. The num¬ 
bers are reported by training the joint fc+conv models. 

5.6. Effect of convolutional features 

The convolutional classifier and joint fc-Fconv model op¬ 
erate on the feature maps extracted from CNNs. Recent 
work Go) has shown that using features from convolutional 
layers is beneficial over just using the final fully connected 
layer features of a CNN. We evaluate the performance of 
our convolutional classifier using features from different in¬ 
termediate convolutional layers in the VGG net and report 
the results in Table The features from conv5_3 layer are 
more discriminative than the lower Conv4_3 layers. 

5.7. Learning on the full datasets 

Similar to traditional classification models, our proposed 
method can be used for object recognition by training on 
the entire dataset. The results after fine-tuning are shown in 
Table 0. 

































































Model / Dataset 

CUB-2010 

CUB-2011 

OxFlower 

Ours (fc) 
Ours(fc-Fconv) 

0.60 0.64 0.73 

0.62 0.66 0.77 


Table 4. Performance of our model trained on the full dataset, a 
50/50 split is used for each class. 


5.8. Visualizing the learned attributes and text rep¬ 
resentations 

Our proposed model learns to discriminate between un¬ 
seen classes from text descriptions with no additional in¬ 
formation. In contrast, more traditional zero-shot learning 
pipelines often involve a list of hand-engineered attributes. 
Here we assume that only text descriptions and images are 
given to our model. The goal is to generate a list of at¬ 
tributes for a particular class based on its text description. 

Figurej^ left panel, shows the sensitivity of three unseen 
classes on the CUB200-2010 test set using the fc model. 
For each word that appears in these articles, we set the cor¬ 
responding tf-idf entry to 0 and measure the change in PR- 
AUC. We multiply by the ratio of the L2 norms of the tf-idf 
vectors before and after deletion to ensure that the network 
sees the same total input magnitude. The words that re¬ 
sult in the largest decrease in PR-AUC are deemed to be 
the most important words (approximately speaking) for the 
unseen class. 

In some cases the type of bird, such as “tanager”, is 
an important feature. In other cases, physically descriptive 
words such as “purplish” are important. In other cases, non- 
descriptive words such as “variable” are found to be impor¬ 
tant, perhaps due to their rarity in the corpus. The collection 
of sensitive words can be thought of as pseudo-attributes for 
each class. 

In Figure right panel, we show the ability of the text 
features to describe visual features. For the three unseen 
classes, we use the text pipeline of the fc model to produce 
a set of weights, and then search the test set to hnd the im¬ 
ages whose features have the highest dot product with the 
these weights. If we restrict the set of images to within the 
unseen class, we get the test image that is most highly corre¬ 
lated with its textual description. When we allow the images 
to span the entire set of classes, we see that the resulting im¬ 
ages show birds that have very similar physical characteris¬ 
tics to the birds in the unseen classes. This implies that the 
text descriptions are informative of physical characteristics, 
and that the model is able produce a semantically meaning¬ 
ful joint embedding. More examples of these neighborhood 
queries can be found in the supplementary material. 

6. Limitations 

Although, our proposed method shows signihcant im¬ 
provement on ROC-AUC over the previous method, 


the multi-class recognition performance on the zero-shot 
classes, e.g. around 10% top-1 accuracy on CUBird, is still 
lower than some of the attribute-based methods. It may 
be possible to take advantage of the discovered attribute 
list from Sec. ( |5.8| ) to rehne our classihcation performance. 
Namely, one may infer an attribute list for each class and 
learn a second stage attribute classihcation model. We leave 
this for future work. 

7. Conclusion 

We introduced a hexible Zero-Shot Learning model that 
learns to predict unseen image classes from encyclopedia 
articles. We used a deep neural network to map raw text and 
image pixels to a joint embedding space. This can be inter¬ 
preted as using a natural language description to produce a 
set of classiher weights for an object recognition network. 

We further utilized the structure of the CNNs that in¬ 
corporates both the intermediate convolutional feature maps 
and feature vector from the last fully-connected layer. We 
showed that our method signihcantly outperforms previ¬ 
ous zero-shot methods on the ROC-AUC metric and sub¬ 
stantially improves upon the current state-of-the-art on CU¬ 
Bird and Oxford Flower datasets using only raw images and 
text articles. We found that the network was able to learn 
pseudo-attributes from articles to describe different classes, 
and that the text embeddings captured useful semantic in¬ 
formation in the images. 

In future work, we plan to replace the tf-idf feature 
extraction with an LSTM recurrent neural network GD. 
These have been found to be effective models for learning 
representations from text. 
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8. Appendix 

In the following figures, we first show more examples of 
the neighborhood queries for the fc models. We also visu¬ 
alize the predicted convolutional filters of conv models on 
both CU-bird and Oxford Flower datasets. The visualiza¬ 
tion of the predicted convolutional filter from Section 
(3.3) are projected back to image space through the VGG 
19 layer ImageNet model 1^ . 
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Figure 3. [LEFT]: Word sensitivities of unseen classes using the fc model on CUB200-2010. The dashed lines correspond to the test-set PR- 
AUC for each class. TF-IDF entries are then independently set to 0 and the five words that most reduce the PR-AUC are shown in each bar 
chart. Approximately speaking, these words can be considered to be important attributes for these classes. [RIGHT]: The Wikipedia article 
for each class is projected onto its feature vector w and the nearest image neighbors from the test-set (in terms of maximal dot product) 
are shown. The within-class nearest neighbors only consider images of the same class, while the overall nearest neighbors considers all 
test-set images. 
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Figure 4. [LEFT]: Word sensitivities of unseen classes using the fc model on Oxford Flower. The dashed lines correspond to the test-set PR- 
AUC for each class. TF-IDF entries are then independently set to 0 and the five words that most reduce the PR-AUC are shown in each bar 
chart. Approximately speaking, these words can be considered to be important attributes for these classes. [RIGHT]: The Wikipedia article 
for each class is projected onto its feature vector w and the nearest image neighbors from the test-set (in terms of maximal dot product) 
are shown. The within-class nearest neighbors only consider images of the same class, while the overall nearest neighbors considers all 
test-set images. 




























































Figure 5. [LEFT]: Example images of unseen classes and their encyclopedia articles. [RIGHT]: Visualizing the predicted convolutional 
filter w'c of the given unseen article on the left using the conv model trained on CUB200-2010. It shows top 5 images that have the highest 
activations for the predicted conv filters in the validation dataset (including both seen and unseen classes). The filter visualization under the 
images are generated by the deconvolution technique in ED. The highest activation in the predicted convolutional classifier is projected 
back into the image space. Best viewed in electronic version. 

































































Figure 6. [LEFT]: Example images of unseen classes and their encyclopedia articles. [RIGHT]: Visualizing the predicted convolutional 
filter w'c of the given unseen article on the left using the conv model trained on CUB200-2010. It shows top 5 images that have the highest 
activations for the predicted conv filters in the validation dataset (including both seen and unseen classes). The filter visualization under the 
images are generated by the deconvolution technique in ED. The highest activation in the predicted convolutional classifier is projected 
back into the image space. Best viewed in electronic version. 



























































Figure 7. [LEFT]: Example images of unseen classes and their encyclopedia articles. [RIGHT]: Visualizing the predicted convolutional 
filter w'c of the given unseen article on the left using the conv model trained on Oxford Elower. It shows top 5 images that have the highest 
activations for the predicted conv filters in the validation dataset (including both seen and unseen classes). The filter visualization under the 
images are generated by the deconvolution technique in EH. The highest activation in the predicted convolutional classifier is projected 
back into the image space. Best viewed in electronic version. 



















































Figure 8. [LEFT]: Example images of unseen classes and their encyclopedia articles. [RIGHT]: Visualizing the predicted convolutional 
filter w'c of the given unseen article on the left using the conv model trained on Oxford Elower. It shows top 5 images that have the highest 
activations for the predicted conv filters in the validation dataset (including both seen and unseen classes). The filter visualization under the 
images are generated by the deconvolution technique in ED. The highest activation in the predicted convolutional classifier is projected 
back into the image space. Best viewed in electronic version. 














































